2020 Machine Learning class competition @ UT Austin ischool
Won the second place in the ML class with 58% accuracy
With the goal to assist blindness overcome their real daily visual challenges via Computer Vision and AI, VizWiz is a dataset with data submitted by users of a mobile phone application, who each took a picture and (optionally) recorded a spoken question about that picture. Website
When blindness took pictures and asked a question, it will occur so many reasons to let the question can't be answered. To tackle this issue, we tried to extract the features from both images and questions and built a model to predict the set can be answered or not. VQA Challenge
In the feature extracting part, I use four image-based features, including blur value, background color, foreground color, and tag of the image. On the other hand, I use three question-based features, including key phrases, sentiment value, and the first word of the question.
To make all the features can be the input of the model, I transfer it separately. I use one hot encoder to transfer foreground color, background color, and the first word of the question into 0 and 1 value. I also limit blur value and sentimental value to three decimal places only. Besides, I make the keyword comparison between the tag of the image and key phrases of the question. If the label of the image is in the question, the value will be 1. Otherwise, the value will be 0.
I choose blur feature is because sometimes the image is too unclear to be answered. Therefore, I think blur value can be a useful feature to help predict. Also, the reason I choose color is that some pictures will be too dark or too bright to be answered. In terms of question-based features, sentiment value may be a useful feature because the emotion of the word sometimes can decide whether you can get an answer or not. Besides, the first word of the question, like What, Where, and Why, can also determine whether you can get the answer or not, too. Therefore, I choose these features.
Last but not least, I compare the tag of the image with key phrases of the question is because that I think if the question asks something that includes in the image, the question will be answer easier.
To train the classification model, I prepare 2000 training data, 300 validation data, and 100 test data. In the process, I use a cross-validation method to get an overall accuracy of the model and have a test on the validation dataset to check the accuracy.
In terms of the choice of classification models, I tried some classification model that taught in the class and compared their accuracy. I use KNN, decision tree, SVM, Naive Bayes, Neural Network, these five models. However, I think the result is all similar to the accuracy only around 0.55.
After that, I also did some Emsemble methods like Voting and Adaboost. I think the result of the voting model is the same as all five classification model above. However, I found that Adaboost has an effect that is 0.58, which is a little higher than others. Therefore, I choose the Adaboost model to predict my test data in the end.
# Framework for lab 3: predicting whether a question about an image can be answered
img_dir = "https://ivc.ischool.utexas.edu/VizWiz_visualization_img/"
split = 'train'
#split = 'val'
#split = 'test'
annotation_file = 'https://ivc.ischool.utexas.edu/VizWiz_final/vqa_data/Annotations/%s.json' %(split)
print(annotation_file)
image-based feature
from skimage import io
import matplotlib.pyplot as plt
%matplotlib inline
import requests
import cv2
from google.colab.patches import cv2_imshow
import skimage.feature as feature
from skimage import io, color
from skimage.transform import resize
from matplotlib import pyplot as plt
import skimage
subscription_key = '43e430a1bf9443e28c37ef13aad0baf2'
vision_base_url = 'https://southcentralus.api.cognitive.microsoft.com/vision/v1.0'
vision_analyze_url = vision_base_url + '/analyze?'
# evaluate an image using Microsoft Vision API
def analyze_image(image_url):
# Visualize image
image = io.imread(image_url)
# plt.imshow(image)
# plt.axis('off')
# plt.show()
# Microsoft API headers, params, etc
headers = {'Ocp-Apim-Subscription-key': subscription_key}
params = {'visualfeatures': 'Adult,Categories,Description,Color,Faces,ImageType,Tags'}
data = {'url': image_url}
# send request, get API response
response = requests.post(vision_analyze_url, headers=headers, params=params, json=data)
response.raise_for_status()
analysis = response.json()
return analysis
#calculate blur value
def variance_of_laplacian(img_url):
image = io.imread(img_url)
width = 255
height = 255
image = resize(image, (width, height))
greyscale_image = skimage.color.rgb2gray(image)
fm = round(cv2.Laplacian(greyscale_image, cv2.CV_64F).var() * 50, 3)
return fm
def extract_image_features(image_url):
#get the Azure computer vision analysis result from picture
data = analyze_image(image_url)
#dominant Foreground Color in picture
foreColor = ''
if len(data['color']['dominantColorForeground']) == 0:
foreColor = 'no'
else:
foreColor = str(data['color']['dominantColorForeground'])
#dominant Background Color in picture
backColor = ''
if len(data['color']['dominantColorBackground']) == 0:
backColor = 'no'
else:
backColor = str(data['color']['dominantColorBackground'])
#key tags in picture
keyword = []
for i in range(len(data['tags'])):
x = data['tags'][i]['name'].split(' ')
for j in x:
keyword.append(str(j))
keyword = np.array(keyword)
#blur value in picture
blur = variance_of_laplacian(image_url)
return foreColor, backColor, keyword, blur
question-based features
def analyze_question(question):
dic = {"documents": [{"id":1, "text": question}]}
#print(question)
json.dumps(question)
subscription_key = '43e430a1bf9443e28c37ef13aad0baf2'
endpoint = 'https://southcentralus.api.cognitive.microsoft.com'
sentiment_url = endpoint + "/text/analytics/v2.1/sentiment"
headers = {"Ocp-Apim-Subscription-Key": subscription_key}
response = requests.post(sentiment_url, headers=headers, json=dic)
sentiments = response.json()
sentimentsValue = round(sentiments['documents'][0]['score'], 3)
#pprint(sentiments)
key = []
keyphrase_url = endpoint + "/text/analytics/v2.1/keyphrases"
response = requests.post(keyphrase_url, headers=headers, json=dic)
key_phrases = response.json()
#key = key_phrases['documents'][0]['keyPhrases']
if(len(key_phrases['documents'][0]['keyPhrases'])!= 0):
for i in key_phrases['documents'][0]['keyPhrases']:
x = i.split(' ')
for j in x:
key.append(j);
return sentimentsValue, key
def extract_question_features(question):
#get the Azure text analusis result from question
sentiments, keyword = analyze_question(question)
#extract first word of question
Qword = str(np.array(question.split(' ')[0]))
return sentiments, keyword, Qword
extract features and combined together
# Read the file to extract each dataset example with label
import requests
import numpy as np
split_data = requests.get(annotation_file, allow_redirects=True)
num_VQs = 2000
k = 0
data = split_data.json()
X = []
y = []
foregroundColor = []
backgroundColor = []
questionWord = []
for vq in data[0:num_VQs]:
# Extracts features decribing the image
image_name = vq['image']
image_url = img_dir + image_name
forecolor, backcolor, key_vision, blur = extract_image_features(image_url)
print(forecolor, backcolor, key_vision, blur)
# Extracts features decribing the question
question = vq['question']
sentiments, key_text, Qword = extract_question_features(question)
print(sentiments, key_text, Qword)
#check the question keyword is in picture or not
keyFeature = 0
matchKey = [i for i in key_vision if i in key_text]
if len(matchKey) > 0:
keyFeature = 1
# Create a multimodal feature to represent both the image and question
multimodal_features = np.array([blur, sentiments, keyFeature])
# Prepare features and labels
X.append(multimodal_features)
label = vq['answerable']
y.append(label)
print(k, multimodal_features, label)
k += 1
foregroundColor.append(forecolor)
backgroundColor.append(backcolor)
questionWord.append(Qword)
# print(image_name)
# print(question)
# print(label)
# print(multimodal_features)
# visualize_image(image_url)
One Hot Encoder
from sklearn.preprocessing import OneHotEncoder
def oneHotTransform(Tarray):
enc = OneHotEncoder()
a = np.reshape(Tarray, (-1, 1))
enc.fit(a)
ans = enc.transform(a).toarray()
return ans
forecolorFeature = oneHotTransform(foregroundColor)
backcolorFeature = oneHotTransform(backgroundColor)
QwordFeature = oneHotTransform(questionWord)
X = np.concatenate((X, forecolorFeature, backcolorFeature, QwordFeature), axis = 1)
PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=25)
pca.fit(X_train)
X_train_reduced = pca.transform(X_train)
X_val_reduced = pca.transform(X_val)
X_test_reduced = pca.transform(X_test)
KNN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
training_precision_manhattan = []
training_precision_euclidean = []
best_precision = 0
for i in range(1, 3):
neighbor_setting = range(3, 20)
for curKvalue in neighbor_setting:
knn_clf = KNeighborsClassifier(n_neighbors = curKvalue, p= i)
kfold_shuffled = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
fold_train_precision = cross_val_score(knn_clf, X_train_reduced, Y_train, cv=kfold_shuffled, scoring = 'precision')
cur_train_precision = fold_train_precision.mean()
if(cur_train_precision > best_precision):
best_param = {'p': i, 'n_neighbors': curKvalue}
best_precision = cur_train_precision
if(i == 1):
training_precision_manhattan.append(cur_train_precision)
else:
training_precision_euclidean.append(cur_train_precision)
knn_clf = KNeighborsClassifier(**best_param)
knn_clf.fit(X_train_reduced, Y_train)
knn_pred = knn_clf.predict(X_val_reduced)
print(classification_report(knn_pred, Y_val))
Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
training_precision_gini = []
training_precision_entropy = []
best_precision = 0
for i in ['gini', 'entropy']:
tree_setting = range(3, 20)
for value in tree_setting:
tree_clf = DecisionTreeClassifier(criterion = i, max_depth = value)
kfold_shuffled = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
fold_train_precision = cross_val_score(tree_clf, X_train_reduced, Y_train, cv=kfold_shuffled, scoring = 'precision' )
cur_train_precision = fold_train_precision.mean()
if(cur_train_precision > best_precision):
best_param = {'criterion': i, 'max_depth': value}
best_precision = cur_train_precision
if(i == 'gini'):
training_precision_gini.append(cur_train_precision)
else:
training_precision_entropy.append(cur_train_precision)
tree_clf = DecisionTreeClassifier(**best_param)
tree_clf.fit(X_train_reduced, Y_train)
tree_pred = tree_clf.predict(X_val_reduced)
print(classification_report(tree_pred, Y_val))
SVM
from sklearn.svm import SVC
G1 = 1 / (X_train_reduced.var() * X_train_reduced[1].size) + 0.00001
G2 = 1 / (X_train_reduced.var() * X_train_reduced[1].size) + 0.00002
best_precision = 0
for curD in range(2,6):
for curC in [0.1, 1, 10]:
for curG in ['scale', G1, G2]:
param = {'C': curC, 'degree': curD, 'gamma': curG}
print(param)
svm_clf = SVC(kernel='poly', degree=curD, C=curC, gamma=curG)
kfold_shuffled = StratifiedKFold(n_splits=5, shuffle=True, random_state=2)
fold_train_precision = cross_val_score(svm_clf, X_train_reduced, Y_train, cv=kfold_shuffled, scoring = 'precision')
precision = fold_train_precision.mean()
if precision > best_precision:
best_param = {'C': curC, 'degree': curD, 'gamma': curG}
best_precision = precision
svm_clf = SVC(**best_param)
svm_clf.fit(X_train_reduced, Y_train)
svm_pred = svm_clf.predict(X_val_reduced)
print(classification_report(svm_pred, Y_val))
Naive Bayes
from sklearn.naive_bayes import GaussianNB
gaussian_model = GaussianNB()
gaussian_model.fit(X_train_reduced, Y_train)
bayes_pred = gaussian_model.predict(X_val_reduced)
Neural Network
from sklearn.neural_network import MLPClassifier
d_hidden = []
d_nn_acc = []
for i in range(1,6):
n_hidden_nodes = 64 * i
acc = []
for j in range(1,6):
layers = [n_hidden_nodes] * j
mlp = MLPClassifier(activation='tanh', hidden_layer_sizes = layers, max_iter=20, verbose=False)
mlp.fit(X_train_reduced, Y_train)
acc.append(mlp.score(X_val_reduced,Y_val))
print(n_hidden_nodes, j, mlp.loss_, mlp.score(X_val_reduced,Y_val))
d_hidden.append(i)
d_nn_acc.append(acc)
n_hidden_nodes = 320
layers = [n_hidden_nodes] * 5
mlp = MLPClassifier(activation='tanh', hidden_layer_sizes = layers, max_iter=20, verbose=False)
mlp.fit(X_train_reduced, Y_train)
acc = mlp.score(X_val_reduced,Y_val)
mlp_pred = mlp.predict(X_val_reduced)
Voting
from sklearn.ensemble import VotingClassifier
eclf = VotingClassifier(estimators=[('knn', knn_clf), ('dt', tree_clf), ('svm', svm_clf), ('nb', gaussian_model), ('nn', mlp)], voting='hard')
eclf.fit(X_train_reduced, Y_train)
vote_pred = eclf.predict(X_val_reduced)
print(classification_report(vote_pred, Y_val))
Adaboost
from sklearn.ensemble import AdaBoostClassifier
adabooster = AdaBoostClassifier(n_estimators = 40)
adabooster.fit(X_train_reduced, Y_train)
adabooster_pred = adabooster.predict(X_val_reduced)
print(classification_report(adabooster_pred, Y_val))
import csv
predictions = adabooster.predict(X_test_reduced)
# f = open("results.csv", mode="w")
with open("/content/drive/My Drive/Colab Notebooks/results.csv", mode="w") as f:
results = csv.writer(f)
for prediction in predictions:
results.writerow([prediction])